npj Digital Medicine
Top medRxiv preprints most likely to be published in this journal, ranked by match strength.
Show abstract
Generative artificial intelligence (AI) is rapidly populating medical records with synthetic or partially AI-generated content, creating a feedback loop where future models are increasingly at risk of training on uncurated AI-generated data. However, the clinical consequences of this AI-generated data contamination remain unexplored. Here, we show that in the absence of mandatory human verification, this self-referential cycle drives a rapid erosion of pathological variability and diagnostic rel...
Show abstract
BackgroundDiagnostic errors are a leading cause of preventable patient harm, often occurring during early clinical encounters where diagnostic uncertainty is maximal. Large language models (LLMs) have shown potential in medical reasoning, yet their ability to function as a diagnostic safety net, specifically by identifying and correcting human diagnostic errors, remains systematically unquantified. We evaluated whether state-of-the-art LLMs can effectively challenge, rather than merely confirm, ...
Show abstract
Pathology faces persistent challenges including a global shortage of specialists, uneven access to expertise, increasing diagnostic complexity, and a growing need for second-opinion consultations. While digital and telepathology platforms address parts of this problem, existing solutions often trade accessibility for structured, workflow-aware clinical integration. At the same time, multimodal medical AI shows promise for diagnostic support but raises concerns regarding transparency, automation ...
Show abstract
Chronic wounds affect over 1.2 million Canadians and incur healthcare costs exceeding $13 billion annually, with global expenditures approaching $149 billion. Current clinical practice relies on manual measurements and subjective visual evaluations, which overestimate wound area by up to 40% and demonstrate poor-to-moderate inter-rater reliability. This variability complicates longitudinal monitoring and evidence-based treatment selection. We developed and evaluated an integrated mobile platform...
Show abstract
Traditional surgical training relies heavily on hands-on experiences gained through relatively infrequent procedures during apprenticeships. Recently, postoperative review has become a valuable supplement to this model, offering learning opportunities outside the operating room. However, its adoption remains limited due to its inefficiencies. In this study, we developed a Computer Vision-based system designed to efficiently navigate and retrieve critical segments from laparoscopic cholecystectom...
Show abstract
Medicine historically separates abstract clinical reasoning from physical intervention. We bridge this divide with MedOS, a general-purpose embodied world model. Mimicking human cognition via a dual-system architecture, MedOS demonstrates superior reasoning on biomedical benchmarks and autonomously executes complex clinical research. To extend this intelligence physically, the system simulates medical procedures as a physics-aware model to foresee adverse events. Generating and validating on the...
Show abstract
BackgroundThe assessment of physical examination skills in medical education is resource-intensive and prone to inter-rater variability. While artificial intelligence (AI) has successfully automated the grading of clinical notes and transcripts, evaluating the physical techniques themselves--what students do rather than what they say--remains an unsolved challenge. We evaluated whether a multimodal AI system could assess physical examination skills with expert-level reliability. MethodsIn this ...
Show abstract
BackgroundLarge language models (LLMs) are increasingly deployed in medical contexts as patient-facing assistants, providing medication information, symptom triage, and health guidance. Understanding their robustness to adversarial inputs is critical for patient safety, as even a single safety failure can lead to adverse outcomes including severe harm or death. ObjectiveTo systematically evaluate the safety guardrails of state-of-the-art LLMs through adversarial red-teaming specifically designe...
Show abstract
Algorithmic decision systems mediate access to healthcare, credit, employment and housing, yet individuals who experience adverse decisions face multi-stage barriers when seeking recourse. We formalize these barriers as a series-structured system with 11 empirically parameterized stages across three layers (data integration, data accuracy and institutional access) and prove that single-barrier interventions are bounded by baseline system success. Under baseline parameterization derived from fede...
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWEmergency Department triage is a critical decision-making process in which clinicians must rapidly assess patient acuity under high cognitive load and time pressure. We present ED-Triage-Agent (ETA), a multi-agent AI framework designed to augment clinical decision-making in Emergency Severity Index (ESI) classification through human-AI collaboration. The system operates in two phases: (1) autonomous patient intake via a conversational agent that collects structured sympto...
Show abstract
Wearable devices present transformative opportunities for personalized healthcare through continuous monitoring of digital biomarkers; however, individual variations in device wear time could mask or otherwise impact signal identification. Despite the widespread adoption of wearable devices in research, no comprehensive framework exists for understanding how wear time varies across populations or for addressing wear time-related biases in analysis. Using Fitbit data from 11,901 participants in t...
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWWe present MO_SCPLOWEDC_SCPLOWPI, a high-dimensional benchmark for evaluating large language models (LLMs) in patient-clinician conversations. Unlike single-turn question-answer (QA) benchmarks, MO_SCPLOWEDC_SCPLOWPI evaluates the medical dialogue across 105 dimensions comprising the medical process, treatment safety, treatment outcomes and doctor-patient communication across a granular, accreditation-aligned rubric. MO_SCPLOWEDC_SCPLOWPI comprises five layers: (1) PO_SCP...
Show abstract
We propose a large-scale synthetic dataset that correlates structured background information aligned with the actual distribution of patients with cerebral infarction, nurse characteristics, and nurse-patient dialogues across diverse scenarios. Medical dialogue corpora are scarce due to privacy and access restrictions. Even when available, they primarily focus on physician-patient interactions and offer limited metadata (clinical covariates, staff characteristics, etc.). To address this gap, thi...
Show abstract
Health-Related Social Needs (HRSNs) significantly impact health outcomes, yet traditional care often fails to address them effectively. While conversational agents offer scalable support, their deployment is hindered by privacy risks and a lack of specialized training data for clinical applications. Synthetic data generation offers a solution to address this gap; standard pipelines often prompt LLMs using structured user personas, comprising demographics, constraints, and goals, to emulate dialo...
Show abstract
BackgroundLarge language models show promise for clinical decision support, yet their propensity for hallucination--generating plausible but unsupported claims--poses sub-stantial patient safety risks. Retrieval-augmented generation (RAG) is widely assumed to mitigate this problem by grounding outputs in retrieved documents, but this assumption remains inadequately tested in clinical contexts where information density, temporal complexity, and safety stakes are uniquely high. MethodsWe develope...
Show abstract
BackgroundWe present BODHI (Balanced, Open-minded, Diagnostic, Humble, and Inquisitive), an engineering framework for curiosity-driven and humble clinical decision support AI. Despite growing capabilities, large language models (LLMs) often express inappropriate confidence, conflating statistical pattern recognition with genuine medical understanding. BODHI addresses this through a dual-reflective architecture that: (1) decomposes epistemic uncertainty into task-specific dimensions, and (2) cons...
Show abstract
Artificial intelligence (AI) is increasingly permeating healthcare, from physician assistants to consumer applications. Since AI algorithms opacity challenges human interaction, explainable AI (XAI) addresses this by providing AI decision-making insight, but evidence suggests XAI can paradoxically induce over-reliance or bias. We present results from two large-scale experiments (623 lay people; 153 primary care physicians, PCPs) combining a fairness-based diagnosis AI model and different XAI exp...
Show abstract
BackgroundLarge language models (LLMs) show promise for clinical decision support, yet most validation studies evaluate single models, leaving questions about generalizability and vendor dependence unanswered. We assessed whether a standardized biomarker analysis framework maintains clinical-grade accuracy across multiple LLMs from independent providers. MethodsWe developed a structured prompt-based framework for detecting eight clinical patterns (insulin resistance, diabetes, cardiovascular di...
Show abstract
Large language models (LLMs) are increasingly used in clinical workflows, yet requiring clinician review of every AI output negates the efficiency gains that motivate their adoption. We present SCOUT (Scalable Clinical Oversight via Uncertainty Triangulation), a model-agnostic meta-verification framework that selectively defers unreliable LLM predictions to clinicians by triangulating three orthogonal signals: model heterogeneity, stochastic inconsistency, and reasoning critique. In this retrosp...
Show abstract
Health behaviors such as physical activity and sleep affect mental health, but the effect of each health behavior varies substantially across individuals, limiting the usefulness of generic behavioral recommendations. We collected one year of continuous wearable and ecological momentary assessment data from 3,139 participants in the Intern Health Study (2018-2023), and examined individual-level associations between wearable-derived features and mood across the internship year. The behaviors asso...